I’m using the same dataset from prior assignments, “Differential effect of SARS-CoV-2 spike glycoprotein 1 on human bronchial and alveolar lung mucosa models: Implications on pathogenicity.”, published December 13, 2021. The corresponding publication can be found on GEO and in the references.

This experiment compares the expression response of human alveolar and bronchial cells to SARS-CoV-2 spike protein through RNA sequencing. Figure 1 from the paper shows the experimental design.

Figure 1. Experiment design. Rahman et al, 2021.

Non-thresholded Gene set Enrichment Analysis

I will be using GSEA 4.1.0 from Docker image risserlin/em_base_image:version_with_bioc_3_13 per lectures and a prior homework. I will be using the current April release of a geneset from the Gary Bader lab for enrichment analysis, containing GO biological processes and pathways but no IEA, Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt

This geneset uses HGNC symbols, so construct the ranked list for GSEA accordingly.

rnk <- output_hits[output_hits$gene != "" & !is.na(output_hits$gene),]
rnk$rank <- -log10(rnk$P.Value) * sign(rnk$logFC)
# See https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#Ranked_Gene_Lists
write.table(rnk[,c("gene", "rank")], "GSEA_GSE185657.rnk", sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

The following commands were run in a shell due to instabilility with RStudio in the Docker image used. The parameters match that used in lectures and the prior homework.

host$ docker run --rm -it --user=rstudio -v $(pwd):/home/rstudio/projects risserlin/em_base_image:version_with_bioc_3_13 bash
rstudio@container$ cd ~/GSEA_4.1.0
rstudio@container$ ./gsea-cli.sh GSEAPreRanked -gmx ../projects/Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt -rnk ../projects/GSEA_GSE185657.rnk -collapse false -nperm 1000 -scoring_scheme weighted -rpt_label A3 -plot_top_x 20 -rnd_seed 12345 -set_max 200 -set_min 15 -zip_report false -out ../projects/ > ../projects/gsea_output

Enrichment results summary

All results have been uploaded to GitHub with the main summary here. They are transcribed below.

312 gene sets are significantly upregulated at nominal pvalue < 5%. Top results:

  1. HALLMARK_INTERFERON_ALPHA_RESPONSE (MSIGDB)
  2. INTERFERON ALPHA BETA SIGNALING (REACTOME)
  3. HALLMARK_INTERFERON_GAMMA_RESPONSE (MSIGDB)
  4. DEFENSE RESPONSE TO SYMBIONT (GOBP)
  5. DEFENSE RESPONSE TO VIRUS (GOBP)

350 gene sets are significantly downregulated at nominal pvalue < 5%. Top results:

  1. ELECTRON TRANSPORT CHAIN: OXPHOS SYSTEM IN MITOCHONDRIA (WIKIPATHWAYS)
  2. EUKARYOTIC TRANSLATION ELONGATION (REACTOME)
  3. SELENOCYSTEINE SYNTHESIS (REACTOME)
  4. FORMATION OF A POOL OF FREE 40S SUBUNITS (REACTOME)
  5. CYTOPLASMIC RIBOSOMAL PROTEINS (WIKIPATHWAYS)

Comparison with thresholded analysis from Assignment 2

GSEA found fewer significantly enriched gene sets than differentially-expressed genes found via thresholded analysis at the same p-value of 0.05, however, this is not an appropriate comparison as we are comparing gene sets to genes.

GSEA identified similar upregulated and downregulated pathways. The top hits for upregulated processes and pathways are also related to interferon signaling and the viral defence response, from two of the same data sources, Reactome and GO:BP. The top hits for downregulated processes and pathways concern metabolism and translation, from two of the same data sources, Reactome and WikiPathways. Some of the exact same processes and pathways appear in both lists.

This qualititative comparison is not straightforward as I am not comparing the full lists, just the top results. Furthermore, the data sources used are different. MSigDB is only part of the source used for GSEA. The results from g:Profiler also included one query with the entire differentially-expressed gene list, which has no equivalent in GSEA.

Visualize your Gene set Enrichment Analysis in Cytoscape

I am using Cytoscape 3.9.1 for Windows 64-bit. Through the App Manager, I installed EnrichmentMap Pipeline Collection 1.1.0, which also installs the following apps (also cited in the references):

  • clusterMaker2 2.2
  • AutoAnnotate 1.3.5
  • EnrichmentMap 3.3.3
  • WordCloud 3.1.4
  • NetworkAnalyzer 4.4.8
  • FileTransfer 1.2
  • cyChart 0.3.1

The following files were copied to a separate Windows folder, since I am running Docker in Windows Subsystem for Linux 2 and not mounting a Windows folder.

  • Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt
  • GSEA_GSE185657.rnk produced by this notebook and input to GSEA
  • the entire edb directory from GSEA output

The following command was executed in Cytoscape’s command line, with all specified paths prefixed with the absolute path of the containing Windows directory, backslashes escaped. This command and the specified thresholds were obtained from these course notes.

enrichmentmap build analysisType="gsea" gmtFile="Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt" pvalue=0.05 qvalue=0.25 similaritycutoff=0.375 coefficients=COMBINED ranksDataset1="edb\\GSEA_GSE185657.rnk" enrichmentsDataset1="edb\\results.edb" filterByExpressions=false expressionDataset1="GSEA_GSE185657.rnk"

To make it more explicit from that command, the thresholds used are:

  • pvalue = 0.05
  • qvalue = 0.25
  • similaritycutoff = 0.375
  • coefficients = COMBINED

Legend for all enrichment maps

Legend for all enrichment maps

Click on any of the enrichment maps to enlarge them.

Initial enrichment map

Initial Cytoscape enrichment map

According to “Tools > Analyze Network,” treating it as an undirected graph, there are 309 nodes and 1609 edges.

Annotated with clusters

Using AutoAnnotate on the entire network with largely default settings:

  • prevent cluster overlap
  • use clusterMaker
  • cluster algorithm: MCL Cluster
  • label column: GS_DESCR
  • label algorithm: WordCloud: Adjacent Words
  • max words per label: 3
  • minimum word occurence: 1
  • adjacent word bonus: 8

Annotated Cytoscape enrichment map

Manually-adjusted publication-ready enrichment map

Publication-ready Cytoscape enrichment map

Theme network

Theme network

The major themes match the top results from GSEA and thresholded analysis. They fit with the model and show established pathways. The two major connected groups of themes are one upregulated group related to telomeres, mitosis, and homologous recombination, basically cell division; and one group related to downregulation of metabolish in the electron transport chain and translation, while upregulating interferons as part of the viral response.

Interpretation and detailed view of results

The enrichment results support the conclusions in the original paper and are very similar to the results from Assignment 2. The authors “observed a typical anti-viral response in the bronchial model and a pro-fibrotic response in the alveolar model.” Like g:Profiler, GSEA also identifies interferon signaling and viral defence response as upregulated and metabolism and translation as downregulated. GSEA and the enrichment maps also indicate that the mitochondrial electron transport chain and steps of mitosis are affected. The largest downregulated cluster is actually related to the electron transport chain. The largest upregulated cluster is actually related to mitosis, which supports the authors’ conclusion regarding a pro-fibrotic response in alveolar cells.

Additional evidence in the literature

According to Reactome (2010), interferons “are cytokines that play a central role in initiating immune responses, especially antiviral.” A 2018 study by Qu et al noted that mitochondria serve as “a signaling hub for the innate immune response and facilitate downstream signaling leading to [interferon] synthesis.”

The samples in this experiment were treated with just SARS-CoV-2 spike protein. It is well-known that the spike protein of SARS-CoV-2 is used as a target for many current vaccines, especially the mRNA vaccines (CDC link). The spike protein of coronaviruses in general has been used as a target for vaccines and treatments since well before the current global COVID-19 pandemic (Du et al. 2009).

Dark matter analysis (option 3)

This code comes from the lecture slides.

library(GSA)
capture.output(genesets <- GSA.read.gmt("Human_GOBP_AllPathways_no_GO_iea_April_01_2022_symbol.gmt"),file="gsa_load.out")
names(genesets$genesets) <- genesets$geneset.names
#get all the GSEA directories
gsea_directories <- list.files(pattern = "\\.GseaPreranked")
gsea_dir <- gsea_directories[1]
#get the gsea result files
gsea_results_files <- list.files(path = gsea_dir, pattern = "gsea_report_*.*.tsv")
#there should be 2 gsea results files
enr_file1 <- read.table(file.path(gsea_dir,gsea_results_files[1]), 
                      header = TRUE, sep = "\t", quote="\"",  
                      stringsAsFactors = FALSE,row.names=1)
enr_file2 <- read.table(file.path(gsea_dir,gsea_results_files[2]), 
                      header = TRUE, sep = "\t", quote="\"",  
                      stringsAsFactors = FALSE,row.names=1)
#get the genes from the set of enriched pathways (no matter what threshold)
all_enr_genesets<- c(rownames(enr_file1), rownames(enr_file2))
genes_enr_gs <- c()
for(i in 1:length(all_enr_genesets)){
  current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_enr_genesets[i])]) 
  genes_enr_gs <- union(genes_enr_gs, current_geneset)
}
FDR_threshold <- 0.001
#get the genes from the set of enriched pathways (no matter what threshold)
all_sig_enr_genesets<- c(rownames(enr_file1)[
  which(enr_file1[,"FDR.q.val"]<=FDR_threshold)],
rownames(enr_file2)[which(enr_file2[,"FDR.q.val"]<=FDR_threshold)])
genes_sig_enr_gs <- c()
for(i in 1:length(all_sig_enr_genesets)){
  current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_sig_enr_genesets[i])])
  genes_sig_enr_gs <- union(genes_sig_enr_gs, current_geneset)
}
genes_all_gs <- unique(unlist(genesets$genesets))

How many significantly enriched genes are in the gene set?

length(genes_sig_enr_gs)
## [1] 1344

Venn diagram of genes analyzed

if (!require(VennDiagram)) {
  install.packages("VennDiagram")
}
library(VennDiagram)
draw.triple.venn(
  area1=length(genes_all_gs),
  area2=length(genes_enr_gs),
  area3 =length(rnk$gene),
  n12 = length(intersect(genes_all_gs,genes_enr_gs)),
  n13=length(intersect(genes_all_gs,rnk$gene)),
  n23 = length(intersect(genes_enr_gs,rnk$gene)),
  n123 = length(intersect(genes_all_gs,intersect(genes_enr_gs,rnk$gene))),
  category = c(
  "all genesets",
  "all enrichment results",
  "expression"),
  fill = c("red","green","blue"),
  cat.col = c("red","green","blue")
)

## (polygon[GRID.polygon.1], polygon[GRID.polygon.2], polygon[GRID.polygon.3], polygon[GRID.polygon.4], polygon[GRID.polygon.5], polygon[GRID.polygon.6], text[GRID.text.7], text[GRID.text.8], text[GRID.text.9], text[GRID.text.10], text[GRID.text.11], text[GRID.text.12], text[GRID.text.13], text[GRID.text.14])

Top dark matter genes

These are significantly differentially expressed but weren’t in the annotation data source used. Note that none of these top hits are missing HGNC symbols.

genes_no_annotation <- setdiff(rnk$gene, genes_all_gs)
ranked_gene_no_annotation <- rnk[which(rnk$gene %in% genes_no_annotation),]
knitr::kable(ranked_gene_no_annotation[1:10,], type="html")
Row.names gene logFC AveExpr t P.Value adj.P.Val B rank
11213 ENSG00000188707 ZBED6CL 1.687004 1.338325 3.665809 0.0011895 0.8315623 -2.602529 2.924641
1492 ENSG00000090372 STRN4 -6.626449 27.454873 -3.549819 0.0015910 0.8315623 -2.706156 -2.798341
6511 ENSG00000145337 PYURF -15.009661 124.198252 -3.536024 0.0016467 0.8315623 -2.718508 -2.783375
12175 ENSG00000210194 MT-TE -2.960040 6.106249 -3.474690 0.0019187 0.8315623 -2.773484 -2.716985
6141 ENSG00000141854 MISP3 -3.381942 6.542874 -3.411846 0.0022427 0.8315623 -2.829890 -2.649231
14557 ENSG00000285230 RALY-AS1 -3.366286 5.865339 -3.389710 0.0023690 0.8315623 -2.849771 -2.625436
12854 ENSG00000236432 MFF-DT 1.260441 1.795750 3.322266 0.0027979 0.8315623 -2.910370 2.553169
12468 ENSG00000224281 SLC25A5-AS1 -1.579658 1.870118 -3.286434 0.0030554 0.8315623 -2.942571 -2.514925
9897 ENSG00000175575 PAAF1 -8.386556 41.826509 -3.265846 0.0032137 0.8315623 -2.961073 -2.493000
12496 ENSG00000225269 LINC00705 2.104523 2.212895 3.188359 0.0038831 0.8315623 -3.030678 2.410816

Heatmap of significant genes not annotated to any pathway returned in the enrichment analysis

top_hits <- output_hits$gene[output_hits$P.Value < 0.05]
dark_matter <- setdiff(top_hits, genes_sig_enr_gs)
dark_matter_ensembl <- output_hits$Row.names[output_hits$gene %in% dark_matter]
heatmap_matrix_tophits <- t(scale(t(normalized_counts[which(rownames(normalized_counts) %in% dark_matter_ensembl),])))
heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
  cluster_rows = TRUE, show_row_dend = TRUE,
  cluster_columns = FALSE,show_column_dend = FALSE,
  col=heatmap_col,show_column_names = TRUE,
  show_row_names = FALSE,show_heatmap_legend = TRUE)
heatmap

Heatmap of significant genes not annotated to any pathway in the entire set of pathways

top_hits <- output_hits$gene[output_hits$P.Value < 0.05]
dark_matter <- setdiff(top_hits, genes_all_gs)
dark_matter_ensembl <- output_hits$Row.names[output_hits$gene %in% dark_matter]
heatmap_matrix_tophits <- t(scale(t(normalized_counts[which(rownames(normalized_counts) %in% dark_matter_ensembl),])))
heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
  cluster_rows = TRUE, show_row_dend = TRUE,
  cluster_columns = FALSE,show_column_dend = FALSE,
  col=heatmap_col,show_column_names = TRUE,
  show_row_names = FALSE,show_heatmap_legend = TRUE)
heatmap

References

Centers for Disease Control and Prevention. Understanding mRNA COVID-19 Vaccines (2022). Retrieved 2022 Apr 5 from https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/mrna.html

Du, L., He, Y., Zhou, Y. et al. The spike protein of SARS-CoV — a target for vaccine and therapeutic development. Nat Rev Microbiol 7, 226–236 (2009). https://doi.org/10.1038/nrmicro2090

Garapati, PV. Interferon Signaling (2010). Reactome. Retrieved 2022 Apr 5 from https://reactome.org/content/detail/R-HSA-913531

GeneSetEnrichmentAnalysisWiki. Data formats. 2020. Retrieved 2022 Apr 5 from https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

Isserlin R. Enrichment Map Analysis Pipeline. 2020. Retrieved 2022 Apr 5 from https://baderlab.github.io/Cytoscape_workflows/EnrichmentMapPipeline/Protocol2_createEM.html

Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS One. 2010 Nov 15;5(11):e13984. doi: 10.1371/journal.pone.0013984. PMID: 21085593; PMCID: PMC2981572.

Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics. 2011 Nov 9;12:436. doi: 10.1186/1471-2105-12-436. PMID: 22070249; PMCID: PMC3262844.

Oesper L, Merico D, Isserlin R, Bader GD. WordCloud: a Cytoscape plugin to create a visual semantic summary of networks. Source Code Biol Med. 2011 Apr 7;6:7. doi: 10.1186/1751-0473-6-7. PMID: 21473782; PMCID: PMC3083346.

Rahman M, Irmler M, Keshavan S, Introna M et al. Differential Effect of SARS-CoV-2 Spike Glycoprotein 1 on Human Bronchial and Alveolar Lung Mucosa Models: Implications for Pathogenicity. Viruses 2021 Dec 17;13(12). PMID: 34960806

Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, Wadi L, Meyer M, Wong J, Xu C, Merico D, Bader GD. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc. 2019 Feb;14(2):482-517. doi: 10.1038/s41596-018-0103-9. PMID: 30664679; PMCID: PMC6607905.

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003 Nov;13(11):2498-504. doi: 10.1101/gr.1239303. PMID: 14597658; PMCID: PMC403769.

Qu C, Zhang S, Wang W, Li M, Wang Y, van der Heijde-Mulder M, Shokrollahi E, Hakim MS, Raat NJH, Peppelenbosch MP, Pan Q. Mitochondrial electron transport chain complex III sustains hepatitis E virus replication and represents an antiviral target. FASEB J. 2019 Jan;33(1):1008-1019. doi: 10.1096/fj.201800620R. Epub 2018 Aug 2. PMID: 30070932; PMCID: PMC6355081.

1. Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, et al. Rmarkdown: Dynamic documents for r. 2021.
2. Xie Y, Dervieux C, Riederer E. R markdown cookbook. Boca Raton, Florida: Chapman; Hall/CRC; 2020.
3. Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb: Powerful alternative search engine for the gene expression omnibus. Bioinformatics (Oxford, England). 2008;24:2798–800.
4. Davis S, Meltzer P. GEOquery: A bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics. 2007;14:1846–7.
5. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21:3439–40.
6. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.
7. Xie Y. Knitr: A general-purpose package for dynamic report generation in r. 2021.
8. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43:e47.
9. Gu Z, Gu L, Eils R, Schlesner M, Brors B. Circlize implements and enhances circular visualization in r. Bioinformatics. 2014;30:2811–2.
10. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016.
11. Efron B, Tibshirani R. GSA: Gene set analysis. 2019.
12. Chen H. VennDiagram: Generate high-resolution venn and euler plots. 2021.

Journal

GitHub wiki